feat(results): add export() method and --output-format CLI flag#540
feat(results): add export() method and --output-format CLI flag#540przemekboruta wants to merge 7 commits intoNVIDIA-NeMo:mainfrom
Conversation
Adds DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet. The CLI create command gains --output-format / -f which writes dataset.<format> alongside the parquet batch files.
Greptile SummaryThis PR adds a streaming
|
| Filename | Overview |
|---|---|
| packages/data-designer/src/data_designer/interface/results.py | Adds ExportFormat Literal, SUPPORTED_EXPORT_FORMATS constant, count_records() method, and streaming export() with per-format helpers; logic is correct across all three format paths. |
| packages/data-designer/src/data_designer/cli/controllers/generation_controller.py | Replaces load_dataset() with count_records() for record reporting and adds post-generation export block; CLI-level click.Choice validation prevents invalid formats before any work begins. |
| packages/data-designer/src/data_designer/cli/commands/create.py | Adds --output-format / -f typer option with click.Choice validation derived from SUPPORTED_EXPORT_FORMATS; correctly forwarded to run_create(). |
| packages/data-designer/tests/interface/test_results.py | Comprehensive parametrized tests covering all three formats, format inference, explicit override, schema unification, empty-batch error, and round-trip correctness. |
| packages/data-designer/tests/cli/controllers/test_generation_controller.py | Adds happy-path and failure-path tests for export via CLI controller; mock helpers updated from load_dataset to count_records. |
Sequence Diagram
sequenceDiagram
participant CLI as create_command (CLI)
participant GC as GenerationController
participant DD as DataDesigner
participant Res as DatasetCreationResults
participant FS as Filesystem (batch_*.parquet)
CLI->>CLI: click.Choice validates --output-format
CLI->>GC: run_create(config_source, num_records, output_format)
GC->>DD: create(config_builder, num_records)
DD-->>GC: results: DatasetCreationResults
GC->>Res: count_records()
Res->>FS: pq.read_metadata(batch_*.parquet)
FS-->>Res: row counts
Res-->>GC: total record count
alt output_format is not None
GC->>Res: export(base_path / dataset.fmt)
loop each batch_*.parquet
Res->>FS: read batch
Res->>FS: append to output file
end
Res-->>GC: Path to exported file
GC->>CLI: print export path
end
GC->>CLI: print_success(N record(s) generated)
Reviews (7): Last reviewed commit: "fix(results): address nabinchha review —..." | Re-trigger Greptile
andreatgretel
left a comment
There was a problem hiding this comment.
Thanks for this @przemekboruta - export() fills a real gap and the core implementation is clean. A few things to tighten up before merging:
- Per CONTRIBUTING.md, features should have an associated issue. Could you open one retroactively and link it here with
Closes #NNN? - The success message / export ordering and double dataset load need fixing (see inline comments)
- Need controller-level tests for the new
output_formatpath - happy path, bad format rejection, and export failure - A couple of style guide items flagged inline (error types, import placement)
Left details in inline comments. Nice catch on the lazy import fix in the third commit.
…, import hygiene - Derive SUPPORTED_EXPORT_FORMATS from get_args(ExportFormat) so the two can't drift apart - Replace ValueError with InvalidFileFormatError in export() — consistent with project error conventions - Add date_format="iso" to to_json() for consistent datetime serialization across formats - Add click.Choice(SUPPORTED_EXPORT_FORMATS) to --output-format CLI option for parse-time validation, better --help output, and tab completion - Fix double load_dataset() in run_create: inline len() so the DataFrame ref dies before export - Move success message after the export block to avoid "Dataset created" followed by "Export failed" - Move imports to module level in test_results.py (json, Path, lazy already imported) - Add controller-level tests for output_format happy path, bad format rejection, and export failure
|
hey @andreatgretel! thanks for the review. All points seem to be addressed right now. |
|
Issue #566 has been triaged. The linked issue check is being re-evaluated. |
andreatgretel
left a comment
There was a problem hiding this comment.
Thanks @przemekboruta, all review comments are addressed - approving!
We should be able to merge soon. Seeking one more approval - cc @NVIDIA-NeMo/data-designer-reviewers.
|
Heads up on OOM risk with large datasets The proposed The data is already on disk as individual batch files (
One nuance for the parquet path: batch files can have slightly mismatched schemas (e.g., a column inferred as Example implementation sketch: from __future__ import annotations
from pathlib import Path
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
SUPPORTED_FORMATS = {"jsonl", "csv", "parquet"}
def export(self, path: str | Path) -> Path:
output = Path(path)
fmt = output.suffix.lstrip(".")
if fmt not in SUPPORTED_FORMATS:
raise ValueError(
f"Unsupported file extension {output.suffix!r}. "
f"Use one of: {', '.join(SUPPORTED_FORMATS)}"
)
batch_files = sorted(self.artifact_storage.final_dataset_path.glob("*.parquet"))
if not batch_files:
raise FileNotFoundError("No batch parquet files found.")
if fmt == "jsonl":
_export_jsonl(batch_files, output)
elif fmt == "csv":
_export_csv(batch_files, output)
elif fmt == "parquet":
_export_parquet(batch_files, output)
return output
def _export_jsonl(batch_files: list[Path], output: Path) -> None:
with output.open("w") as f:
for batch_file in batch_files:
chunk = pd.read_parquet(batch_file)
f.write(chunk.to_json(orient="records", lines=True))
f.write("\n")
def _export_csv(batch_files: list[Path], output: Path) -> None:
for i, batch_file in enumerate(batch_files):
chunk = pd.read_parquet(batch_file)
chunk.to_csv(output, mode="a", header=(i == 0), index=False)
def _export_parquet(batch_files: list[Path], output: Path) -> None:
schemas = [pq.read_schema(f) for f in batch_files]
unified = pa.unify_schemas(schemas)
with pq.ParquetWriter(output, unified) as writer:
for batch_file in batch_files:
table = pq.read_table(batch_file)
table = table.cast(unified)
writer.write_table(table)Usage: results.export("output.jsonl")
results.export("output.csv")
results.export("output.parquet") |
…atasets - Rewrite export() to read batch parquet files one at a time instead of materialising the full dataset via load_dataset(); peak memory is now proportional to a single batch regardless of dataset size - Infer output format from file extension by default; format= parameter kept as an explicit override (e.g. writing .txt as JSONL) - _export_parquet unifies schemas across batches (pa.unify_schemas) to handle type drift (e.g. int64 vs float64 in the same column) - Drop format= from the controller's export() call — path already carries the correct extension - Rewrite export tests around real batch parquet files (stub_batch_dir fixture); add tests for multi-batch output, schema unification, unknown extension, empty batch directory, and explicit format override
|
Hey @nabinchha! I totally agree with your proposal and I've implemented it. Looking forward to your feedback PR description updated |
nabinchha
left a comment
There was a problem hiding this comment.
Thanks @przemekboruta! One blocker on memory safety and a few small wins to tidy up before merge.
Summary
Adds DatasetCreationResults.export(path, format=) (jsonl/csv/parquet) that streams batch parquet files one at a time to bound peak memory, plus --output-format/-f on data-designer create. Matches the stated intent; extension-based inference with explicit override is a nice ergonomic touch.
Findings
Critical — Let's fix these before merge
packages/data-designer/src/data_designer/cli/controllers/generation_controller.py:161 — CLI path still OOMs on large datasets, defeating the streaming export
- What:
num_records = len(results.load_dataset())reads every batch parquet file into a single in-memory DataFrame just to print the row count, and it runs beforeresults.export(). For any dataset big enough to need streaming, the CLI OOMs on this line before the streaming export gets a chance to run. - Why: The PR's headline property is "peak memory proportional to a single batch regardless of dataset size." The programmatic API honors that —
results.export(...)called directly from Python is safe. Butdata-designer create ... --output-format X(the most visible entry point, and the one users of #566 are most likely to hit) does not. This isn't a nit: the motivating use case is broken through the CLI. - Suggestion: Count rows from parquet metadata — no data pages loaded:
Cleaner option: expose a
batch_files = sorted(results.artifact_storage.final_dataset_path.glob("batch_*.parquet")) num_records = sum(lazy.pq.read_metadata(f).num_rows for f in batch_files)
count_records()(or similar) helper onDatasetCreationResults/ArtifactStorageso the CLI doesn't have to know about batch file layout. Either way, this should land in this PR rather than a follow-up, since it's tied to the claim the PR is making.
Warnings — Worth addressing
packages/data-designer/src/data_designer/cli/controllers/generation_controller.py:131-137 — Redundant validation now that click.Choice is in place
- What:
create.py:44usesclick_type=click.Choice(list(SUPPORTED_EXPORT_FORMATS)), which means Click rejects any invalid--output-formatat parse time with its own error message — therun_createcontroller can never be reached with a bad value via the CLI. The controller-level check (and the local import at line 131) is dead code from the CLI's perspective. - Why: It's a small YAGNI violation (the controller isn't a documented public API, the only caller is
create_command) and the style guide asks for imports at module level —create.pyalready importsSUPPORTED_EXPORT_FORMATSat the top, so lazy-loading doesn't buy anything here either. It also means we maintain two parallel error messages for the same condition. - Suggestion: Either (a) drop lines 131-137 and the corresponding
test_run_create_invalid_output_format_exitstest, or (b) if we want the controller to remain independently callable with defensive validation, move the import to module level and keep the check. I'd lean toward (a) sinceclick.Choicealready gives better--helpoutput and tab completion.
packages/data-designer/src/data_designer/interface/results.py:229-234 — _export_parquet leaks raw pyarrow errors on genuinely incompatible schemas
- What:
pa.unify_schemas(schemas)handles numeric widening fine (tested), but raisespyarrow.lib.ArrowInvalidif batches have differing column names or truly incompatible types. Likewisetable.cast(unified_schema)can raise on an unsupported cast. Those escape unwrapped. - Why: Per AGENTS.md / STYLEGUIDE.md, third-party exceptions at module boundaries should be normalized to
data_designererror types. Callers writingexcept InvalidFileFormatError(orexcept DataDesignerError) will miss these, and the raw pyarrow traceback leaks an implementation detail of the engine into user-facing error handling. - Suggestion: Wrap the unify/cast block:
try: unified_schema = lazy.pa.unify_schemas(schemas) except lazy.pa.lib.ArrowInvalid as e: raise InvalidFileFormatError(f"Cannot unify batch schemas for parquet export: {e}") from e
Suggestions — Take it or leave it
packages/data-designer/src/data_designer/interface/results.py:138 — File extension matching is case-sensitive
- What:
path.suffix.lstrip(".")returns"JSONL"foroutput.JSONL, which fails the format lookup. - Suggestion:
path.suffix.lstrip(".").lower()— tiny ergonomic win, no downside.
packages/data-designer/src/data_designer/interface/results.py:211-212 — The trailing-newline guard in _export_jsonl is probably unnecessary
- What:
to_json(orient="records", lines=True)always emits a trailing\nin supported pandas versions, soif not content.endswith("\n")never fires. Not harmful, just noise. - Suggestion: Either drop the check, or add a one-line comment noting which pandas behavior it guards against so a future reader doesn't puzzle over it.
What Looks Good
- Schema unification for parquet type drift — the
pa.unify_schemas+table.castapproach is exactly the right move, and the dedicatedtest_export_parquet_schema_unificationtest (int64 → float64 across batches) nails the motivating case. Nice catch during the rewrite. - Streaming helpers are cleanly factored —
_export_jsonl/_export_csv/_export_parquetare small, single-purpose, and easy to reason about in isolation. Theexport()dispatcher stays a readable 15ish lines. - Error-type choices are consistent with the codebase —
InvalidFileFormatErrorfor format problems,ArtifactStorageErrorfor "no batch files" is the right split. And the docstring → raised-type fix in commit50a93d98is a good correctness catch before merge. - Extension-inference-with-override —
results.export("output.csv")vs.results.export("output.txt", format="jsonl")is a clean API.SUPPORTED_EXPORT_FORMATSderived fromget_args(ExportFormat)means the Literal and tuple can't drift. - Test coverage is thorough — parametrized format coverage, content round-trips, multi-batch streaming, schema unification, extension inference, explicit override, empty batch dir, unsupported format, and the controller-level happy/sad paths. Good mix.
Verdict
Needs changes — the pre-export load_dataset() on line 161 breaks the PR's own memory-safety claim for the CLI path and should be fixed in this PR. The redundant controller validation is a smaller cleanup worth bundling in. The rest are genuine take-it-or-leave-it suggestions.
This review was generated by an AI assistant.
…g, UX - Replace load_dataset() with count_records() in CLI to avoid OOM on large datasets; add count_records() method using pq.read_metadata (reads file metadata only, no data pages loaded) - Remove redundant format validation in controller — click.Choice in create.py already rejects invalid values at parse time; dead code removed along with corresponding test - Wrap pa.unify_schemas / table.cast ArrowInvalid as InvalidFileFormatError to normalize third-party exceptions at module boundaries per AGENTS.md - Lowercase file extension before format lookup so .JSONL/.CSV/.PARQUET are accepted without error - Add clarifying comment to trailing-newline guard in _export_jsonl - Add tests: count_records(), uppercase extension, incompatible schemas
Summary
Closes #566
DatasetCreationResults.export(path, format=)supporting jsonl, csv, and parquet formats--output-format/-fflag to thedata-designer createCLI command withclick.Choicevalidation; writesdataset.<format>alongside the parquet batch filesformat=is an optional explicit override (e.g. write a.txtfile as JSONL)SUPPORTED_EXPORT_FORMATSis derived fromget_args(ExportFormat)so the Literal and the tuple cannot drift apartInvalidFileFormatError(consistent with project error conventions)date_format="iso"for consistent datetime serialization across formatsImplementation notes
Streaming export (no OOM risk)
export()never materialises the full dataset in memory. Instead of callingload_dataset(), it reads the individual batch parquet files fromartifact_storage.final_dataset_pathone at a time and appends each to the output file, keeping peak memory proportional to a single batch regardless of dataset size:pyarrow.parquet.ParquetWriteris used to write each batch as a row group into one output file; schemas are unified withpa.unify_schemasupfront to handle type drift (e.g.int64vsfloat64in the same column across batches)Format inference
CLI:
Test plan
test_export_writes_file— parametrized over all 3 formatstest_export_jsonl_content— each line is valid JSON, all records presenttest_export_csv_content— single header row, full record counttest_export_parquet_content— DataFrame round-trip across two batchestest_export_infers_format_from_extension— format omitted, inferred from.jsonltest_export_explicit_format_overrides_extension—.txtwritten as JSONLtest_export_parquet_schema_unification— two batches with diverging column types merged correctlytest_export_unknown_extension_raises— raisesInvalidFileFormatErrortest_export_unsupported_explicit_format_raises— raisesInvalidFileFormatErrortest_export_no_batch_files_raises— raisesArtifactStorageErrortest_export_returns_path_object— returnsPathfor str inputtest_run_create_with_output_format_happy_path— export called with correct pathtest_run_create_invalid_output_format_exits— bad format exits before generation startstest_run_create_export_failure_exits— export exception exits with code 1output_formatparameter